Statistical Thinking About Home Run Hitting

Jim Albert, Emeritus Professor, BGSU

2024-04-01

Introduction

My Background

  • Grew up in Philly area playing baseball, basketball and tennis.
  • Followed the Philadelphia Phillies
  • Good in math – led to a doctorate in Statistics.
  • Prof at BGSU, got tenured, and started working on sports problems

Early Work in Statistics & Sports

Frederick Mosteller

  • A pioneer in using statistics to analyze an array of topics as disparate as anesthesia, presidential elections and baseball
  • Worked on problems in baseball, football and golf
  • What is the chance that the better team wins the World Series?

Early Work in Statistics & Sports

Carl Morris

  • Explored applications of statistics in a variety of sports.
  • Started Harvard Sports Analysis Collective (HSAC)
  • Famous baseball study by Brad Efron and Carl Morris illustrating multilevel estimation

Efron and Morris Graph

Why Should a Statistician Look at Sports Problems?

  • Sports is a great way to communicate statistical thinking.

  • Knowing both the sports application and the theory helps the research.

  • In sports, we learn how people in the world interpret and use data.

My Baseball Research

  • Career trajectories of performance
  • Streakiness patterns in home run hitting
  • Recent patterns of home run hitting (Is it the ball?)

Part I: Career Trajectories

Comparing Mickey Mantle and Hank Aaron

  • Who was the better home run hitter?

  • Look at some career statistics:

Player Home_Runs PA HR_Rate
Mickey Mantle 536 9910 5.41
Hank Aaron 755 13941 5.42

Mantle’s Career Trajectory

Fit a Parabola

Define Peak and Peak Age

Distinguish Two Types of Ability

  • Distinguishing home run performance and home run ability

  • Peak Ability – what is a player’s home run ability at his peak?

  • Career Ability – what is a player’s career home run abaility?

Address by a Multilevel Model

  • HR counts are binomial with probabilities \(p_j\)
  • Probabilities satisfy logistic regression model \[\log \left(\frac{p_j}{1 - p_j}\right) \sim N(\mu_j, \sigma)\]

where the means satisfy a quadratic model

\[\mu_j = \beta_0 + \beta_1 * AGE_j + \beta_2 * AGE_j^2\]

Fit this Multilevel Model

  • Posterior estimates shrink observed HR rates towards the parabola

  • Get posterior of a player’s peak ability \[ PEAK = max_j \{p_j\} \]

  • Get posterior of age that achieves peak

At What Age Do Ballplayer’s Achieve Peak Performance?

  • Bill James 1982 Baseball Abstract “Looking for the Prime”

  • James: “Any successful statistical analysis of aging must find some way to deal with the ‘white space’”

  • This is a missing data problem

Career Trajectories of Mantle and Five Similar Players

Multilevel Model

  • Sampling: Binomial sampling where the probabilities follow a logistic parabolic model

  • Prior: Sets of regression coefficients follow a multivariate normal distribution

  • Model fits “borrow strength” to get improved estimates at the fitted trajectories

  • Improved estimates at peak ages – players tend to peak about age 28

But …

  • 28 is a poor estimate of peak age of all MLB players.

  • Should account for the missing data.

  • Schuckers, Lopez and MacDonald (2024) illustrate estimating player aging curves using regression and imputation.

Part II: Exploring Streaky Patterns

Problem: Looking for True Streakiness

  • Collect hitting data for all players in a season
  • Focus on patterns of hot and cold hitting
  • Is there evidence that players are truly streaky?
  • Or maybe we are observing coin-flipping behavior

Binary Sequence

  • Observe sequence of hitting data for a player
  • For each plate appearance, observe a HR (1) or not (0)
  • Focus on pattern of streaks and slumps
  • Look at spacings, the number of “failures” between consecutive home runs

Example - Mike Schmidt

  • In 1980 season, Schmidt hit 48 home runs on these PAs:

25 32 41 45 72 76 86 87 100 131 141 150 160 162 176 178 182 187 221 228 269 301 316 339 342 343 368 406 414 420 425 433 454 455 473 522 540 554 578 588 596 598 604 616 637 640 645 652

  • Spacings are 25, 7, 9, 4, 27, …

Plan

  • Need a measurement of streakiness
  • Consider two probability models - Consistent and Streaky
  • Construct a Bayesian measure to distinguish between the models

Geometric Model

  • Let \(y_1, ..., y_n\) denote the observed spacings.
  • Assume \(y_j\) are independent \[ y_j \sim Geometric (p_j) \]
  • Put models on the probabilities \(p_1, ..., p_n\)

Two Models

  • Model \(C\): Hitter is truly consistent

\[p_1 = ... = p_n = p\]

  • Model \(S\): Hitter is truly streaky

the \(p_j\) are different and distributed according to a Beta curve

Bayes Factor

  • Ratio of the marginal probabilities of the observed spacings under the two models.
  • Bayes factor in support of streaky \(S\) is \[ BF = \frac{f(y | S)}{f(y | C)} \]
  • If \(\log BF > 0\), support for true streakiness.

Focus on Hitters on Career Home Run Leaderboard

  • Look at spacings between HR for each season
  • Plot log BF against Season
  • Are there any sluggers who show interesting streaky patterns?

log Bayes Factors - Hank Aaron

log Bayes Factors - Albert Pujols

Example of a Streaky Season

Overall Streakiness?

  • Maybe players are truly consistent and we are observing “chance” streakiness due to multiplicity.

  • Are these streaky outcomes a result of a consistent model for hitting home runs for all players?

Predict Data from Consistent Model

  • Estimate home run probabilities \(P_1, ..., P_N\) for \(N\) players using an exchangeable model.
  • Simulate binary outcomes from Bernoulli distributions using these probability estimates.
  • Find the number of players where these is support for streakiness (using Bayes factor).

Predictive Simulation Results

  • Find more streaky players in the observed data than one would predict based on simulations from a consistent model.

  • So patterns of home run streakiness are “interesting”.

  • Raises question: Do truly streaky hitters exist?

Part III: Understanding Surge in Home Run Hitting

HR Totals in the Statcast Era

Season Home Runs
2015 4909
2016 5610
2017 6105
2018 5585
2019 6776
2021 5944
2022 5215
2023 5868

Focus on In-Play Rates

  • Define the home run rate as the fraction of \(HR\) among all batted balls (\(AB - SO\))\[ HR \, Rate = \frac{HR}{AB - SO} \]

  • Look at history of \(HR\) rates

History of In-Play Home Run Rates

What is Causing the Rise in Home Rate Rates?

  • Fall of 2017 a committee was charged by Major League Baseball to identify the potential causes of the increase in the rate at which home runs were hit from 2015 to 2017.

  • Committee released two reports (May 2018 and December 2019)

Possible Reasons for Increase in HRs

The batters?

  • Changes in characteristics of batted balls
  • Launch angle, exit velocity, and spray angle

The pitchers?

  • Changes in types of pitches
  • Pitch location

Possible Reasons for Increase in HRs

The ball?

  • Changes in how the ball is made?
  • Seam height, core?
  • Drag coefficient (resistance of ball as it travels)?

Possible Reasons for Increase in HRs

Game conditions?

  • Ballpark effect
  • Weather
  • Cold vs. hot temperatures

Process of Hitting a Ball

  • IN-PLAY: Have to put the ball in play

  • HIT IT RIGHT: The batted ball needs to have the “right” launch angle and exit velocity

  • REACH THE SEATS: Given the exit velocity and launch angle, needs to have sufficient distance and height to clear the fence (the carry of ball)

Recent Exploration of Home Run Rates

  • Nine seasons of Statcast data (2015 - 2023) are available
  • Have launch speed and launch angle measurements for all seasons
  • Take a broader perspective on home run hitting

Empirical Approach

  • Look at region of launch angle and exit velocity where most of home runs are hit
  • Look at rate of batted balls in this region – how does it vary by season?
  • Look at rate of home runs for balls hit in this region – how does it vary season?

Launch Vars Where Most HR are Hit (RED Zone)

Balls in Play Rate

  • Interested in rate of “home run likely” (RED Zone) batted balls \[ BIP \, Rate = \frac{HR \, Likely}{BIP} \]
  • Are batters changing their approach?
  • Players getting stronger?

Rate of Balls Hit in RED Zone

Rate of Balls Hit in RED Zone

  • See a general increase in “home run likely” rates over Statcast period
  • Players appear to be changing their hitting approach or they are getting stronger

Home Run Rate in RED Zone

  • What is the chance of a home run given good values of launch angle and exit velocity? \[ HR \, Rate = \frac{HR} {HR \,Likely} \]
  • Characteristic of the baseball
  • Changes in drag coefficient over seasons?

Home Run Rate in RED Zone

Home Run Rate in RED Zone

  • General increase from 2015 to 2017
  • Big dip in 2018, followed by big increase in 2019
  • General decrease from 2019 to 2023
  • These “ball effects” are large

Modeling Approach - Aaron Judge

  • Slugger currently playing for Yankees
  • Broke American League HR record with 62 in 2022
  • Currently has hit 257 HR in career

Aaron Judge in 2022

  • Hit 62 home runs during a season when the ball was relatively dead

  • Raises the question: How many home runs would Judge hit during a different season during Statcast era?

Methodology

  • Suppose the different season is 2019.

  • Fit a “2019 ball model” that predicts the probability of a HR in 2019 given values of the launch angle and exit velocity.

  • Collect the launch variables for Judge for all balls put into play. For each BIP, predict P(HR) using 2019 ball model.

  • Sum the probabilities – predict the season HR.

Generalized Additive Model

  • Express the logit of the home run probability as \[ \log \left(\frac{P(HR)}{1 - P(HR)}\right) = s(LA, LS) \]

  • \(s()\) is a smooth function of the launch angle (LA) and the launch speed (LS)

  • Generalization of the linear regression model \(y = X \beta + \epsilon\)

Predict

  • For each Judge’s ball in play in 2022, predict the probability of HR from the launch variables using the 2019 ball model.

  • Sum the probabilities – predict total HR count

  • Can get a 90% prediction interval

Results

  • If Judge was hitting using a 2019 ball, predict he would hit 75 home runs

  • A 90% prediction interval would be (69, 81)

Repeat this method for other Statcast seasons

  • Use GAM model to predict prob(HR) from the launch angle and exit velocity for one season

  • Use this ball model to predict HR probability using 2022 launch variables

  • Sum prediction probabilities

Results

Takeaway

  • Judge only hit 62 home runs in 2022

  • But if he was playing during a different season where the ball was more alive (more carry), the prediction of his 2022 count to be in the 70’s

  • So Judge’s home run achievement is understated

  • Due to this ball bias, we don’t appreciate magnitude of Judge’s accomplishment

Concluding Comments

  • Two important factors in home run hitting are the hitters (values of launch variables) and the ball (carry or drag coefficient).

  • Batters are stronger and changing their hitting approach, leading to higher rates of “HR friendly” balls in play.

  • The composition of the ball has gone through dramatic changes during the Statcast era.

  • Currently the ball is relatively dead compared to previous seasons.

Sports Analytics

  • Growing application of Statistics – have Statistics in Sports section in American Statistical Association

  • Journals such as Journal of Quantitative Analysis of Sports (JQAS) and the Journal of Sports Analytics (JSA).

  • Conferences such as New England Symposium on Statistics in Sports, Carnegie Mellon Sports Analytics Conference, and Saberseminar.

Great Opportunity For Students to Explore Sports Data

  • Abundance of free sports data available.

  • Tools for working with data (such as R) are readily available.

  • Analysis jobs and internships are available in professional sports teams.

  • College sports teams can benefit with an “analytics coach”.

Learn More?

Analyzing Baseball Data with R

  • 3rd edition (with Max Marchi and Ben Baumer) available this summer

  • Online version available NOW at http://tinyurl.com/abdwr3e

  • Chapters on sources of baseball data, sabermetrics, and R